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Background/Motivation 


NASA HEC supports at least 5 pseudospectral applications: 


Spherical Geometry 


Cartesian Geometry 

• DYNAMO 


• HPS 

• MoSST 

• ASH 


• DDSCAT 
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All except ASH use a 1 D 
decomposition: 

• limits scalability/performance 

• constrains grid resolution 


Spherical Geometry 

• DYNAMO 

• MoSST 

• ASH 


Cartesian Geometry 

• HPS 

• DDSCAT 







Consequences of 1 D decomposition 



Scaling Legendre Transforms 




1 D Radial 



1 D Wavenumber 


2D Extrapolation 



# cores 


# cores 
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Other motivations 


Pseudospectral methods have an elegant structure that provides quite interesting 
challenges from a software design perspective 
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Other motivations 


Pseudospectral methods have an elegant structure that provides quite interesting 
challenges from a software design perspective 

• Alternate between local computation and all-to-all communication 

• Complicated data structures (harmonic truncation) 

• Nontrivial load-balance 

• Most numerical calculations can be done with vendor-optimized libraries 
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Solution - SpF 

SpF (Spectral Framework) is a software framework tailor designed to maximize the 
performance and scalability of pseudospectral applications. 
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Benefits of adopting SpF 



• Less duplication of effort 

• Parallel “transforms” — Legendre, LU Decomposition, etc. 

• Tedious/fragile transpose implementations 

• Reduced effort to exploit new architectures/accelerators 

• Readily adopt/share performance innovations within the community 
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Challenges and Complications 
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SpF: The Secret Sauce 

Each transform can be expressed as a union of independent, atomic (indivisible) 
numerical ‘kernels’ {Ki, K2 , . . . , K„}. 
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SpF; Key Software Abstractions 

• Kernel - Indivisible unit of algorithm 
• Most user customization is here 
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SpF; Key Software Abstractions 

• Kernel - Indivisible unit of algorithm 

• Most user customization is here 

• Permutor - Automatic generaWon of communication logic for global transposes 

• Distributor - Responsible for domain decomposition (including load balance) 

• IndexSpace - Describes arbitrary data layout within/across PEs 

• Field - Local array and associated IndexSpace 

• Lingua franca within the framework 

• Input/output for kernei 

• FieldList - Local collection of Field objects 

• In/out for permutation 

• LinearSolver - Implicit updates 

• Integrator - time integration (CN, AB, ...) 
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SpF; Implementation details 


• Object-oriented design (ala Fortran 2003) 

• Applications built by extending SpF abstractions 

• User-extensions that can be shared by community 

• Aggressive use of test-driven development (TDD) & 
pFUnit 

• More that 300 unit tests 

• Runs on at least 3 compilers (Intel, GNU, NAG) 

• Demonstrated with multi-layer shallow water 

• Not quite ready for distribution 



How SpF sees an application 

Transform Permute Transform 
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F 
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IndexSpace - Cartesian Example 


1 

2 

3 

4 

5 

6 

7 

8 
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use SpFjnod 

class (IndexSpace) :: cartesian 

type (RangeAxis) :: xAxis, yAxis, zAxis 

xAxis = RangeAxis( ’x’ , nx) 
yAxis = Range Axis(’y’ , ny) 
zAxis = RangeAxis(’z’ , nz) 

allocate(cartesian , source= xAxis*yAxis*zAxis) 
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IndexSpace - Cartesian Bundle 


1 

use Spfjnod 

2 

class (IndexSpace) :: cartesianBundle 

3 

type (RangeAxis) :: xAxis, yAxis, zAxis 

4 

R 

type (String Axis) :: qtys 

vJ 

6 

xAxis = RangeAxis( ’x’ , nx) 

7 

yAxis = Range Axis(’y’ , ny) 

8 

zAxis = RangeAxis(’z’ , nz) 

9 

10 

qtys = StringAxis(’qty’, [’W, ’Z’ , ’S’, ’P’]) 

11 

allocate (cartesianBundle , source= & 

12 

& xAxis*yAxis*z Axis* qty s ) 
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IndexSpace - Triangular Truncation 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


use SpFjnod 

class (IndexSpace) :: tDomain 
class (OuterProductSpace) :: modeAxis 
type (RangeAxis) : : rAxis 

modeAxis = RangeAxis ( ’m’ , 0, 0) * RangeAxis (’ ell ’ , 0, Lmax) 
Allocate (tDomain, source^node) 

do m = 1 , mlVfex 

modeAxis = RangeAxis ( ’m’ , m, m) * RangeAxis (’ ell ’ , m, Lmax) 
allocate (tDomain, source= tDomain + modeAxis) 
end do 

allocate (tDomain, source= RangeAxis( ’ r ’ ,nn)*tDomain) 
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Automating the transpose 


First we translate the index space into a labelled table: 


1 

m 

r 

/ 

0 

0 
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’S’ 

1 

0 

1 

’S’ 

10 
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10 

’Z’ 
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’W’ 
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’W’ 

0 

12 

2 

P’ 
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Automating the transpose 


Then we append process and offset metadata: 


£ 

m 

r 

/ 

PE 



£ 

m 

r 

/ 

PE 


0 

0 

1 

’S’ 

0 

0 


1 

2 

3 

’W’ 

0 

0 

1 

0 

1 

’S’ 

0 
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1 

2 

4 

’W’ 

0 

1 

10 

7 

10 

’Z’ 

8 

15 


0 

12 

2 

’P’ 

8 

15 
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Automating the transpose 


Then we append process and offset metadata: 



0 0 1 'S’ 0 0 3 7 

1 0 1 'S' 0 1 10 1 


10 7 10 'Z' 8 15 2 9 
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Automating the transpose 


For a 2-phase (nested) tranpsopse, we append the rank for each phase 


i m r f 

PEo PEi S 


i m r f 

PEo PEi S 

0 0 1 ’S’ 

1 0 1 ’S’ 

0 0 0 
0 0 1 

7 2 3 ’W’ 

7 2 4 ’W’ 

0 0 0 
0 0 1 

10 7 10 T 

8 2 15 


0 12 2 P’ 

8 2 15 
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DYNAMO 


Author: Gary Glatzmaier 



Primary configuration 

• Azimuthal wavenumbers 
distributed over PEs 

• Constraint Np <= 

• Supports variant spectral 
truncations and variant 
hyperviscosity terms 
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MoSST 


Author: Weijia Kuang 



Primary configuration 

• One dimensional Distribution 
over PEs at all stages 

• Constraint: Np < Nm 

• Constraint: Np < Nr 

• Constraint: All Spherical 
transforms (Legendre and 
FFT) are in the same process 
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Adopting SpF - general strategy 


1 Establish regression tests and data for baseiine. 

• I n vest i n ac h i evi n g strong reproducibility 

• Turn off optimization and turn on debugging flags 

2 Proceed with incremental changes that preserve results 

3 Commit to repository after each success. 

4 Minor roundoff issues may be encountered 

• Isolate cause, then update baseline regression data 

• Bracket change in repository 
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Adopting SpF - copy to/from legacy data 
structures 


1 Declare a FieldList object 

2 Create a procedure that copies an array into a Field 

3 For each contiguous array 

1 Define corresponding IndexSpace domain object 

2 Call appendO method on FieldList 

3 insert call to copy procedure just prior to use 
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Adopting SpF - Kernel Factory 

1 Create a new module: 

1 Define a derived type that extends KernelFactory 

2 Implement methods that compute Kernel IndexSpace (I/O) 

3 Define a derived type that extends Kernel 

4 Implement apply() method that wraps actual computation 

2 Declare and initialize in main code: 

1 new Factory defined above 

2 Distributor, Permutor 

3 Tasklist, and 2 FieldLists (in and out) 

4 Build task list, and field lists using distributor and factory 

5 Build permutor object connecting previous transform to new 

3 Use in main loop: 

1 Insert call to apply() method of Tasklist object 
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Adoption status 

MoSST 

• Now uses SpF permutations 
DYNAMO 

• SpF conversion completed for 

• Legendre transforms 

• Quadratic convolution 

• Stream to vector (i.e. {W, Z,...} — ^ {vr, ve, . . .}) 

• Permutations (including to/from legacy layout) 

• Took 1 week for expert (me) 

• Lots of ugly shortcuts 

• Issues encountered with implicit update step 

• Could “cheat” 

• Will use experience to instead improve framework 
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Example - top declaration 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


type (SimpleMpiDistributor) :: d 

type (FieldList) :: leg_in, leg_out, NL_in, NL_out 

type (LegendreFactory) : : legFactory 

type (NL_ConvolutionFactory) : : NL_Factory 

type (Partitioned Algorithm) :: legTasks, NL_tasks 

type (SimpleMpIPermutor) : : perm 

class (IndexSpace) :: initialDomain 

d = SimpleMpiDistributor (MPI_communicator) 
legFactory = LegendreFactory (mlVfe 1023) 

NL_Factory = NL_ConvolutionFactory(ni , nk) 
initialDomain = ... 
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Example - initialization 


1 

2 

3 

4 

5 

6 

7 

8 
9 


legTasks = (distribute (legFactory , initialDomain) 
leg_in = FieldList(legTasks, ’ in ’ ) 
leg_out = FieldList (legTasks, ’ out’ ) 

NL_tasks = (distribute (NL_Factory, leg_in) 

NL_in = FieldList (NL_tasks, ’ in ’ ) 

NL_out = FieldList (NL_tasks, ’ out ’ ) 

perm = SimpleMpiPermutor(MPI_communicator, leg_out, NL_in) 


T. Clune SpF - DPSIC 26/31 





Example - execute 


1 

2 

3 

4 

5 


call legTask^pply(leg_in , leg_out) 
call pem%)ermute(leg_out, NL_in) 
call NL_task^pply (NL_in, NL_out) 
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Variations 
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Variations 


1 

2 

3 


! Alternate load balancing strategy 
! type (SimpleMpiDistributor) :: d 
type (RoundRobinDistributor) : : d 


T. Clune 


SpF - DPSIC 28/31 




Variations 


1 

2 

3 


! Alternate load balancing strategy 
! type (SimpleMpiDistributor) :: d 
type (RoundRobinDistributor) : : d 


5 

! Alternative permutation strategy 


6 

! type (SimpleMpiPermutor) : : perm 


7 

type (SomeOtherPermutor) : : perm 
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Variations 


1 

2 

3 


! Alternate load balancing strategy 
! type (SimpleMpiDistributor) :: d 
type (RoundRobinDistributor) : : d 


5 

! Alternative permutation strategy 


6 

! type (SimpleMpiPermutor) : : perm 


7 

type (SomeOtherPermutor) : : perm 



9 

10 

11 


! Alternative Legendre implementation 
! type (LegendreFactory) : : legFactory 
type (AltLegFactory) : : legFactory 
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Next steps 


• Finish ports of DYNAMO, MoSST, HPS, DDSCAT 

• Improve framework 

• Generalize/optimize Permutor classes 

• Allow for multiple sources 

• Allow for “subsetting” 

• Implement multiphase transpose (ala Nick Featherstone) 

• Extend/improve kernels 

• Better mechanism for defining offsets 

• Allow for multiple sources/destinations 

• Allow for “fat” kernels that do internal communication (e.g. implicit treatment of 
Coriolis) 

• Release SpF as open source 
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Questions? 


T. Clune 


SpF - DPSIC 31/31 



